Hadoop and Spark Performance for the Enterprise by Andy Oram
Author:Andy Oram
Language: eng
Format: epub, mobi, pdf
Publisher: O'Reilly Media, Inc.
Published: 2016-07-22T04:00:00+00:00
Log resource usage, recording when a change to container limits was required, and display this information for future use by programmers and administrators
Now we can turn to distributed systems, explore why they have variable resources needs, and look at some solutions that improve performance.
Performance Variation in Distributed Processing
Hadoop and Spark jobs are launched, usually through YARN, with fixed resource limits. When organizations use in-house virtualization or a cloud provider, a job is launched inside a VM with specified resources. For instance, Microsoft Azure allows the user to specify the processor speed, the number of cores, the memory, and the available disk size for each job. Amazon Web Services also offers a variety of instance types (e.g., general purpose, compute optimized, memory optimized).
Hadoop uses cgroups, a Linux feature for isolating groups of processes and setting resource limits. cgroups can theoretically change some resources dynamically during a run, but are not used for that purpose by Hadoop or Spark. cgroups’ control over disk and network I/O resources is limited.
But as explained earlier, the resource needs of distributed processing can actually swing widely, just like operating system processes. There are various reasons for these shifts in resource needs.
First, an organization multitasks. In an attempt to reduce costs, it schedules multiple jobs on a physical or virtual system. Under favorable conditions, all jobs can run in a reasonable time and maximize the use of physical resources. But if two jobs spike in resource usage at the same time, one or both can suffer. The host system cannot determine that one has a higher priority and give it more resources.
Second, each type of job has reasons for spiking or, in contrast, drastically reducing its use of resources. HBase, for instance, suffers resource swings for the same reasons as other databases. It might have a period of no queries, followed by a period of many simultaneous queries. A query might transfer just one record or millions of records. It might require a search through huge numbers of records—taking up disk I/O, network I/O, and CPU time—or be able to consult an index to bypass most of these burdens. And HBase can launch background tasks (such as compacting) when other jobs happen to be spiking, as well.
MapReduce jobs are unaffected by outside queries but switch frequently between CPU-intensive and I/O-intensive tasks for their own reasons. At the beginning, a map job opens files from the local disk or via HDFS and does seeks on disk to locate data. It then reads large quantities of data. The strain on I/O is then replaced by a strain on computing to perform the map calculations. During calculations, it performs I/O in bursts by writing intermediate output to disk. It might then send data over the network to the reducers. The same kinds of resource swings occur for reduce tasks and for Spark. Each phase can use seconds or minutes.
Figure 1-1 shows seven of the many statistics tracked by Pepperdata. Although Pepperdata tracks hardware usage for every individual process (container or
Download
Hadoop and Spark Performance for the Enterprise by Andy Oram.mobi
Hadoop and Spark Performance for the Enterprise by Andy Oram.pdf
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
What's Done in Darkness by Kayla Perrin(26960)
The Ultimate Python Exercise Book: 700 Practical Exercises for Beginners with Quiz Questions by Copy(20859)
De Souza H. Master the Age of Artificial Intelligences. The Basic Guide...2024 by Unknown(20613)
D:\Jan\FTP\HOL\Work\Alien Breed - Tower Assault CD32 Alien Breed II - The Horror Continues Manual 1.jpg by PDFCreator(20538)
The Fifty Shades Trilogy & Grey by E L James(19460)
Shot Through the Heart: DI Grace Fisher 2 by Isabelle Grey(19381)
Shot Through the Heart by Mercy Celeste(19242)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 10 by Isuna Hasekura and Jyuu Ayakura(17388)
Python GUI Applications using PyQt5 : The hands-on guide to build apps with Python by Verdugo Leire(17356)
Peren F. Statistics for Business and Economics...Essential Formulas 3ed 2025 by Unknown(17188)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 03 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(17099)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 01 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(16713)
The Subtle Art of Not Giving a F*ck by Mark Manson(14831)
The 3rd Cycle of the Betrayed Series Collection: Extremely Controversial Historical Thrillers (Betrayed Series Boxed set) by McCray Carolyn(14443)
Stepbrother Stories 2 - 21 Taboo Story Collection (Brother Sister Stepbrother Stepsister Taboo Pseudo Incest Family Virgin Creampie Pregnant Forced Pregnancy Breeding) by Roxi Harding(14219)
Cozy crochet hats: 7 Stylish and Beginner-Friendly Patterns from Baby Beanies to Trendy Bucket Hats by Vanilla Lazy(13504)
Scorched Earth by Nick Kyme(13096)
Reichel W. Numerical methods for Electrical Engineering, Meteorology,...2022 by Unknown(12980)
Drei Generationen auf dem Jakobsweg by Stein Pia(11259)